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We study the behavior of network diffusions based on the PageRank random walk from a set 
of seed nodes. These diffusions are known to reveal small, localized clusters (or communities) 
and also large macro-scale clusters by varying a parameter that has a dual-interpretation 
as an accuracy bound and as a regularization level. We propose a new method that quickly 
approximates the result of the diffusion for all values of this parameter. Our method 
efficiently generates an approximate solution path or regularization path associated with a 
PageRank diffusion, and it reveals cluster structures at multiple size-scales between small 
and large. We formally prove a runtime bound on this method that is independent of the 
size of the network, and we investigate multiple optimizations to our method that can 
be more practical in some settings. We demonstrate that these methods identify refined 
clustering structure on a number of real-world networks with up to 2 billion edges. 

Key AVords: 05C81 Random walks on graphs; 05C50 Graphs and linear algebra (matrices, 
eigenvalues, etc.); 90C35 Programming involving graphs or networks; 91D30 Social networks; 
05C82 Small world graphs, complex networks 


1 Introduction 


Networks describing complex technological and social systems display many types of 
structure. One of the most important types of structure is clustering because it reveals the 
modules of technological systems and communities within social systems. A tremendous 
number of methods and objectives have been proposed for this task (survey articles 
include refs. 26 30 ). The vast majority of these methods seek large regions of the graph 


that display evidence of local structure. For the case of modularity clustering, methods 
seek statistically anomalous regions; for the case of conductance clustering, methods seek 
dense regions that are weakly connected to the rest of the graph. All of the objective 
functions designed for these clustering approaches implicitly or explicitly navigate a 
trade-off between cluster size and the underlying clustering signal. For example, large 
sets tend to be more anomalous than small sets. Note that these trade-offs are essential 
to multi-objective optimization, and the choices in the majority of methods are natural. 
Nevertheless, directly optimizing the objective makes it difficult to study these structures 
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as they vary in size from small to large because of these implicit or explicit biases. This 
intermediate regime represents the meso-scale structure of the network. 

In this manuscript, we seek to study structures in this meso-scale regime by analyzing 
the behavior of seeded graph diffusions. Seeded graph diffusions model the behavior of a 
quantity of “dye” that is continuously injected at a small set of vertices called the seeds 
and distributed along the edges of the graph. These seeded diffusions can reveal multi-scale 
features of a graph through their dynamics. The class we study can be represented in 
terms of a column-stochastic distribution operator P: 

X = ET=o 7fePs 


where 7 ^ are a set of diffusion coefficients that reflect the behavior of the dye k steps 
away from the seed, and s is a sparse, stochastic vector representing the seed nodes. More 
specifically, we study the PageRank diffusions 

x=Er=o(i 

The PageRank diffusion is equivalent to the stationary distribution of a random walk 
that (i) with probability a, follows an edge in the graph and (ii) with probability (1 — a) 
jumps back to a seed vertex (see Sectionmore detail on this connection). 

PageRank itself has been used for a broad range of applications including data mining. 


machine learning, biology, chemistry, and neuroscience; see our recent survey 11 . Among 


all the uses of PageRank, the seeded variation is frequently used to localize the PageRank 
vector within a subset of the network; this is also known as personalized PageRank 
due to its origins on the web, or localized PageRank because of its behavior. (We will 
use these terms; seeded PageRank, personalized PageRank, and localized PageRank 
interchangeably and use the standard acronym PPR to refer to them.) Perhaps the most 
important justification for this use is presented in |^, where the authors determined a 
relationship between seeded PageRank vectors and low-conductance sets that allowed 
them to create a type of graph partitioning method that does not need to see the entire 
graph. Their PageRank-based clustering method, called the push method, has been used 
for a number of important insights into communities in large social and information 
networks 


17, 21 


Our focus is a novel application of this push method for meso-scale structural analysis 
of networks. Push, which we’ll describe formally in Section]^ depends on an accuracy 
parameter e. As we vary s, the result of the push method for approximating the PageRank 
diffusion reveals different structures of the network. We illustrate three PageRank vectors 
as we vary e for Newman’s network science collaboration graph 25 in Figure There, we 
see that the solution vectors for PageRank that result from push have only a few non-zeros 
for large values of e. (Aside: There is a subtle inaccuracy in this statement. As we shall 
see shortly, we actually are describing degree normalized PageRank values. This difference 
does not affect the non-zero components or the intuition behind the discussion.) This is 
interesting because an accurate PageRank vector is mathematically non-zero everywhere 
in the graph. Push, with large values of £, then produces sparse approximations to the 
PageRank vector. This connection is formal, and the parameter e has a dual interpretation 


as a sparsity regularization parameter [12| (reviewed in Section 3.2). 

The solution path or regularization path for a parameter is the set of trajectories that 
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(a) £ = 10 ^ (b) £ = 10 ® (c) £ = 10 


Figure 1. Nodes colored by their degree-normalized PageRank values as e varies: dark 
red is large, yellow is small. The hidden nodes are mathematically zero. As e decreases, 
more nodes become non-zero. 


the components of the solution trace out as the parameter varies [^. We present new 
algorithms based on the push procedure that allow us to approximate the solution path 
trajectories as a function of e. We use our solution path approximation to explore the 
properties of graphs at many size-scales in Section In our technical description, we 
show that the solution path remains localized in the graph (Theorem |5.1[ ) . Experiments 
show that it runs on real-world networks with millions of nodes in less than a second 
(Section]^. 

The push method has become a frequently-used graph mining primitive because of the 
sparsity of the vectors that result from when push is used to approximate the seeded 
PageRank diffusion, along with the speed at which they can be computed. The method 
is typically used to identify sets of low-conductance in a graph as part of a community 
or cluster analysis 10 13 14 17 21 28 . In these cases, the insights provided by the 


solution paths are unlikely to be necessary. Rather, what is needed is a faster way to 
compute these diffusions for many values of e. We describe a data structure called a shelf 
that we demonstrate can use 40 times as many values of e in only 7 times the runtime 
(Section |6.3[ ). 

We plan to make our computational codes available in the spirit of reproducible research. 


2 Technical Preliminaries 

We first fix our notation and review the Andersen-Chung-Lang procedure, which forms 
the basis for many of our contributions. We denote a graph by G = {V,E), where V is 
the set of nodes and E the set of edges. All graphs we consider are simple, connected, 
and undirected. Let G have n = \V\ nodes and fix a labeling of the graph nodes using 
the numbers 1, 2, ..., n. We refer to a node by its label. For each node j we denote its 
degree by dj. 

The adjacency matrix of the graph G, which we denote by A, is the n x n matrix 
having Ai j = 1 if nodes i and j are connected by an edge, and 0 otherwise. Since G is 
simple and undirected, A is symmetric with Os on the diagonal. The matrix D denotes the 
diagonal matrix with entry {i,i) equal to the degree of node i, di. Since G is connected, 
D is invertible, and we can define the random walk transition matrix P := AD~^. 
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We denote by the standard basis vector of appropriate dimensions with a 1 in entry 
j, and by e the vector of all Is. In general, we use subscripts on matrices and vectors to 
denote entries, e.g. Ai^j is entry (j, j) of matrix A; the notation for standard basis vectors, 
ej, is an exception. Superscripts refer to vectors in a sequence of vectors, e.g. is the 
fcth vector in a sequence. 

For any set of nodes, S' C F, we define the volume of S to be the sum of the degrees of 
the nodes in S, denoted vol(S) = J2jeS ^^xt, define the boundary of S C ]/ to be the 
set of edges that have one endpoint inside S and the other endpoint outside S, denoted 
d{S). Finally, the conductance of S, denoted (j){S), is defined by 


m := 


min{vol(S), vol(I4 — S)} ' 


Conductance can be thought of as measuring the extent to which a set is more connected 
to itself than the rest of the graph and is one of the most commonly used community 


detection objectives 26 


2.1 PageRank and Andersen-Chung-Lang Method 

The Andersen-Chung-Lang method uses PageRank vectors to identify a set of small 
conductance focused around a small set of starting nodes |^. We call such starting nodes 
seed sets and the resulting communities, local communities. We now briefly review this 
method starting with PageRank. 

For a stochastic matrix P, a stochastic vector v, and a parameter a G (0,1) we define 
the PageRank diffusion as the solution x to the linear system 

(I — aP)x = (1 — ajv. (2.1) 

Note that when a G (0,1) the system in ( |2.1[ ) can be solved via a Neumann series 
expansion, and so the solution x to this linear system is equivalent to the PageRank 
diffusion vector described in Section]^ When v = (l/|5'|)e5, i.e. the indicator vector for 
a seed set S', normalized to be stochastic, then we say the PageRank vector has been 
seeded on the set S (or personalized on the set S). 

Given PageRank diffusion scores x, the Andersen-Chung-Lang procedure uses the values 
Xj/dj to determine an order for a sweep-cut procedure (described below) that identifies a 
set of good conductance. Thus, we would like to bound the error in approximating the 
values Xj /dj. Specifically (for their theory) we need our approximate solution x to satisfy 

0 < Xj — Xj < edj or equivalently, x > x, and ||D" \x-x)||oo<e. ( 2 . 2 ) 

Once a PPR diffusion x is computed to this accuracy, a near-optimal conductance set 
located nearby the seed nodes is generated from the following a sweep cut procedure. Rank 
the nodes in descending order by their scaled diffusion scores xj/dj , with large scores 
ranking the highest. Denote the set of nodes ranked 1 through m by S{m). Iteratively 
compute the conductance of the sets S{m) for m = 2, 3, ..., until x^ldm = 0. Return 
the set S{t) with the minimal conductance. This returned set is related to the optimal set 
of minimum conductance nearby the seed set through a localized Cheeger inequality . 
The value of e relates to the possible size of the set. 
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3 The push procedure 


The push procedure is an iterative algorithm to compute a PageRank vector to satisfy 
the approximation (2.2). The distinguishing feature is that it can accomplish this goal 
with a sparse solution vector, which it can usually generate without ever looking at the 
entire graph or matrix. This procedure allows the Andersen-Chung-Lang procedure to 
run without ever looking at the entire graph. As we discussed in the introduction, this 
idea and method are at the heart of our contributions and so we present the method in 
some depth. 

At each step, push updates only a single coordinate of the approximate solution like 
a coordinate relaxation method. We’ll describe its behavior in terms of a general linear 
system of equations. Let Mx = b be a square linear system with Is on the diagonal, 

i.e. Mi^i = 1 for all i. Consider an iterative approximation « x after k steps. The 
corresponding residual is = b — Mx^^^. Let j be a row index where we want to 
relax, i.e. locally solve, the equation, and let r be the residual value there, r = L 
We update the solution by adding r to the corresponding entry of the solution vector, 
jj.(fc+i) _ order to guarantee = 0. The residual can be efficiently 

updated in this case. Thus, the push method involves the operations: 


x(fc+l) = 

i.(fc+i) -rMej. (3.1) 

Note that the iteration requires updating just one entry of and accessing only a 
single column of the matrix M. It is this local update that enables push to solve the 
seeded PageRank diffusion especially efficiently. 


3.1 The Andersen-Chung-Lang Push Procedure for PageRank 


The full algorithm for the push method applied to the PageRank linear system to compute 


a solution that satifies (2.2) for a seed set S is: 


1. Initialize x = 0,r = (1 — a)es using sparse data structures such as a hash-table. 

2 . Add any coordinate i of r where r^ > edi to a queue Q. 

3. While Q is not empty 

4. Let j be the coordinate at the front of the queue and pop this element. 

Set Xj = Xj -I- Tj 
Set S = arj/dj 


5. 

6 . 

7. 

8 . 

9. 

10 . 


Set Tj = 0 


For all neighbors u of node j 
Set Tu^Vu + S 

If r„ exceeds edu after this change, add u to Q. 

The queue maintains a fist of all coordinates (or nodes) where the residual is larger than 
edj. We choose coordinates to relax from this queue. Then we execute the push procedure 
to update the solution and residual. The residual update operates on only the nodes that 
neighbor the updated coordinate j. Once elements in the residual exceed the threshold, 
they are entered into the queue. We present the convergence theory for this method in 
the description of our new algorithms (Section]^. 
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We have presented the push method so far from a linear solver perspective. To instead 
view the method from a graph diffusion perspective, think of the solution vector as 
tracking where “dye” has concentrated in the graph and the residual as tracking where 
“dye” is still spreading. At each step of the method, we find a node with a sufficiently 
large amount of dye left (Step 4), concentrate it at that node (Step 5), then update the 
amount of dye that is left in the system as a result of concentrating this quantity of dye 
(Lines 6-10). The name push comes from the pattern of concentrating dye and pushing 
newly unprocessed dye to the adjacent residual entries. 

Note that the value of e plays a critical role in this method as it determines the entries 
that enter the queue. When e is large, only a small number of coordinates or nodes 
will ever enter the queue. This will result in a sparse solution. As e —>■ 0, there will be 
substantially more entries that enter the queue. 


3.2 Implicit regularization from Push 


To understand the sparsity that results from the push method, we introduce a slight 
variation on the standard push procedure. Rather than using the full update Xj -|- Vj 
and pushing aVjjdj to the adjacent residuals, we consider a method that takes a partial 
update. The form we assume is that we will leave edjp “dye” remaining at node j. For 
p = 0, this correspond to the push procedure described above. For p = 1, this update will 
remove node j from the queue, but push as little mass as possible to the adjacent nodes 
such that the dye at node j will remain below edj. The change is just at steps 5-7: 

5’. Set Xj = Xj -|- (rj — edjp) 

6 ’. Set 6 = a(rj — edjp)/dj 
7’. Set Tj = edjp 


In previous work 12, Theorem 3], we showed that p = 1 produces a solution vector x 
that exactly solves a related 1-norm regularized optimization problem. The form of the 
problem that x solves is most cleanly stated as a quadratic optimization problem in z, a 
degree-based rescaling of the solution variable x: 


minimize -z’^Qz — z’^g-|-CeUDzHi 
subject to z > 0 


(3.2) 


The terms of the normalization x vs. z and the equivalence Q, g, C are tedious to state 
exactly and uninformative to our purposes in this work. The important point is that e 
can also be interpreted as a regularization parameter that governs the sparsity of the 
solution vector x. Large values of e increase the magnitude of the 1-norm regularizer and 
thus cause the solutions to be sparser. Moreover, the resulting solutions are unique as the 
above problem is strongly convex. 

In this work, we seek algorithms to compute the solution paths or regularization paths 
that result from trying to use all values of e to fully study the behavior of the diffusion. 
In the next section we explore some potential utilities of these paths before presenting 
our algorithms for computing them in Section]^ 
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4 Personalized PageRank paths 

In this section we aim to show the types of insights that our solution path methodology 
can provide. We should remark that these are primarily designed for human interpretation. 
Our vision is that they would be used by an analyst that was studying a network and 
needed to better understand the “region” around a target node. These solution paths 
would then be combined with something like a graph layout framework to study these 
patterns in the graph. Thus, much of the analysis here will be qualitative. We demonstrate 
quantative advantages to the path methodology in subsequent sections. 


4.1 Exact paths and fast path approximations 


The exact solution path for the seeded PageRank diffusion results from solving the 


regularized optimization problem (3.2) itself for all values of e. This could be accomplished 


by using ideas similar to those used to compute solution paths for the Lasso regularizer . 
Our algorithms and subsequent analysis evaluate approximate solution paths that result 


from using our push-based algorithm with p = 0.9 (Section 5.2). In this section, we 
compare these approximate paths to the exact paths. We find that, while the precise 
numbers change, the qualitative properties are no different. 

Figure]^ shows the results of such a comparison on Newman’s netscience dataset (379 
nodes, 914 edges [^). Each curve or line in the plot represents the value of a non-zero 
entry of an approximate PageRank vector x^ as e varies (horizontal axis). As e approaches 
0 (and 1/e approaches oo), each approximate PageRank entry approaches its exact value 
in a monotonic manner. Alternatively, we can think of each line as the diffusion value of 
a node as the diffusion process spreads across the graph. 


One of the plots was computed by solving for the optimality conditions of (3.2); the 


other plot was computed using the PPR path algorithm from Section |5.2[ The values of e 
are automatically determined by the algorithm itself. The plots show that for the two sets 
of paths have essentially identical qualitative features. For example, they reveal the same 
bends and inflections in individual node trajectories, as well as large gaps in PageRank 
values. The maximum difference between the two paths never exceeds l.f • 10“^. 

These results were essentially unchanged for a variety of other sample diffusions we 
considered, and so we decided that using p = 0.9 was an acceptable compromise between 
speed and exactness. Thus, all path plots in this paper were created with p = 0.9, unless 
noted otherwise. (For analysis of the differences of the exact paths and p-paths, and in 
particular the behavior of the p-approximate paths as p varies, see Figure]^ below.) 


4.2 The Seeded PageRank Solution Path Plot 

We now wish to introduce a specific variation on the solution path plot that shows 
helpful contextual information. In the course of computation, our solution path algorithm 
identifies a small set of values of e (somewhere between a few hundred to a few thousand) 
where it satisfies the solution criteria ( |5.2[ ). At these values, we perform a sweep-cut 
procedure to identify the set of best conductance induced by the current solution. In the 
solution path plot, we display the cut-point identified by this procedure as a thick black 
line. All the nodes whose trajectories are above the dark black line at a particular value 
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Figure 2. (Left) The solution paths for a PageRank diffusion on Newman’s netscience 
dataset from a single seed node computed by exactly solving the regularized problem. 
(Right) The approximate solution paths computed by our push-based solution path 
algorithm with p = 0.9. Each line traces a value Kj as e varies. The maximum infinity- 
norm distance between the two paths is 1.1 • 10“^, showing that p = 0.9 provides a 
good qualitative approximation. Moreover, the two plots highlight identical qualitative 
features—for example, the large gaps between paths, and the strange bend in the paths 
near e = 10“^. The coloring of the lines is based on the values at the smallest value of e. 
The values of e used were generated by the approximate algorithm itself and we computed 
the exact solution at these same values for comparision. 

of e are contained in the set of best conductance at that value of e. This line allows us to 
follow the trajectory of the minimum conductance set as we vary e. Another property 
of our algorithm is that the smallest possible non-zero diffusion value in the solution is 
(1 — p)e. Thus, we plot this as a thin, diagonal, black line that acts as a pseudo-origin for 
all of the node trajectories. The vertical blue lines in the bottom left of the plot mark 
the values of e where we detect a significant new set of best conductance. Representative 
conductance values are shown when there is room in the plot. 

The solution path plot that corresponds to Figure is shown in Figure This plot 
illustrates all of the features we discussed in this section. 

4.3 Nested communities in netscience and Facebook 

We now discuss some of the insights that arise from the solution path plot. In Figure]^ 
we show the seeded PageRank solution path plot for around 21, 000 values of e computed 
via our algorithm for the network science collaboration network. This computation runs 
in less than a second. Here, we see that large gaps in the degree normalized PageRank 
vector indicate cutoffs for sets of good conductance. This behavior is known to occur 
when sets of really good conductance emerge [^. We can now see how they evolve and 
how the procedure quickly jumps between them. In particular, the path plots reveal 
multiple communities (good conductance sets) nested within one another through the 
gaps between the trajectories. 
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Figure 3. An example of the seeded PageRank solution path plot on Newman’s netscience 
dataset. Each colored line represents the value of a single node as the diffusion progresses 
from large e to small e. Because of our p-approximation to the true paths, the smallest 
value any node obtains is (1 — p)e and we plot this as a dark diagonal line. The thick 
black line traces out the boundary of the set of best conductance found at each distinct 
value of e as determined by a sweep-cut procedure. The blue lines indicate significant 
changes to the set of minimum conductance, and they are labelled with the conductance 
value. The coloring of the trajectory lines is based on the values at the smallest value of 
e. We discuss implications of the plot in Section [T3l 


On a crawl of a Facebook network from 2009 where edges between nodes correspond 

(see Tablefb-one, for the statistics), we are able to find 


to observed interactions 29 


a large, low conductance set using our solution path method. (Again, this takes about 
a second of computation.) Pictured in Figure]^ this diffusion shows no sharp drops in 
the PageRank values like in the network science data, yet we still find good conductance 
cuts. Note the few stray “orange” nodes in the sea of yellow. These nodes quickly grow 
in PageRank and break into the set of smallest conductance. Finding these nodes is 
likely to be important to understand the boundaries of communities in social networks; 
these trajectories could also indicate anomalous nodes. Furthermore, this example also 
shows evidence of multiple nested communities. These are illustrated with the manual 
annotations A, B, C. 


4.4 Core and periphery structure in the US Senate 


The authors in 17 analyzed voting patterns across the hrst 110 US-Senates by comparing 
senators in particular terms. We form a graph from this US Senate data where each 
senator is represented by a single node. For each term of the senate, we connect senators 
in that session to their 3 nearest neighbors measured by voting similarities. This graph 
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Figure 4. The seeded PageRank solution path for a crawl of observed Facebook network 
activity for one year (fb-one from Table shows large, good cuts do not need to have 
large drops in the PageRank values. Nodes enter the solution and then quickly break 
into the best conductance set, showing that the frontier of the diffusion should be an 
interesting set in this graph. Furthermore, this path plot shows evidence of multiple 
nested communities (A, B, and C), which were manually annotated. The set A is only 
a few nodes, but has a small conductance score of 0.11; set B grows and improves this 
to a conductance of 0.1, and finally set C achieves a conductance of 0.07, which is an 
unusually small conductance value in a large social network. 

has a substantial temporal structure as a senator from 100 years ago cannot have any 
direct links to a senator serving 10 years ago. We show how our solution paths display 
markedly different characteristics when seeded on a node near the core of the network 
compared with a node near the periphery. This example is especially interesting because 
both diffusions lead to on closely related cuts. 

Figure [^displays solution paths seeded on a senator on the periphery of the network 
(top right) and a senator connected to the core of the network (top left). Here are some 
qualitative insights from the solution path plots. The peripheral seed is a senator who 
served a single term; the diffusion spreads across the graph slowly because the seed is 
poorly connected to the network outside the seed senator’s own senate term. As the 
diffusion spreads outside the seed’s particular term, the paths identify multiple nested 
communities that essentially reflect previous and successive terms of the Senate. In 
contrast, the core node is a senator who served eight terms. The core node’s paths skip 
over such smaller-scale community structures (i.e. individual senate terms) as the diffusion 
spreads to each of those terms nearly simultaneously. Instead, the paths of the core node 
identify only one good cut: the cut separating all of the seed’s terms from the remainder 
of the network. 

This example demonstrates the paths’ potential ability to shed light on a seed’s 
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(f) Periphery, e = 3 • 10 (g) Periphery, e = 10 (h) Periphery, e = 3 • 10 ® 

Figure 5. (Top.) The solution paths on the US-Senate graph for a senator in the core 
(who served multiple terms and is centrally located in a graph layout) and for a senator 
in the periphery (who served a single term and is located on the boundary of the graph 
layout). (Bottom.) The diffusions for each of these senators are shown as heat-plots on the 
graph layout. Red indicates nodes with the largest values and yellow the smallest. The 
seed nodes are circled in these layouts. The solution paths for a peripheral node indicate 
multiple nested communities, visible in the images of the diffusion on the whole graph 
and marked A, R, C, D, E. These sets are strongly correlated with successive terms of the 
Senate. In contrast, the core node diffusion only indicates one good cut. For the core node, 
we can see the diffusion essentially spreads across multiple dense regions simultaneously, 
without settling in one easily separated region until e is small enough that the diffusion 
has spread to the entire left side of the graph. The sets A and F are also almost the same. 
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relationship to the network’s core and periphery, as well as the seed’s relationship to 
many communities. 


4.5 Cluster boundaries in handwritten digit graphs 


Finally, we use the solution paths to study the behavior of a diffusion for a semi-supervised 
learning task. The USPS hand-written digits dataset consists of roughly 10,000 images of 
the digits 0 through 9 in human hand-writing 32 . Each digit appears in roughly 1,000 


of the images, and each image is labelled accordingly. From this data we construct a 
3-nearest-neighbors graph, and carry out our analysis as follows. Pick one digit, and select 
4 seed nodes uniformly at random from the set of nodes labelled with this digit. Then 
compute the PageRank solution paths from these seeds. Figure]^ shows the path plots 
with labels (right) and without (left). In the labelled plot, the correct labels are red and 
the incorrect labels are green. 

We can use the best conductance set determined by the PPR vector to capture a 
number of other nodes sharing the seeds’ label. However, this straight-forward usage of 
a PageRank vector results in a number of false positives. Figure]^ (right) shows that a 
number of nodes with incorrect labels are included in the set of best conductance (curves 
that are not colored red do not share the seed’s label). 

Looking at the solution-paths for this PageRank vector (Figure]^ left) we can see that 
a number of these false positives can be identified as the erratic lighter-orange paths 
cutting across the red paths. Furthermore, the solution paths display earlier sets of best 
conductance (left of the black spikes near e = 10“^) that would cut out almost all false 
positives. This demonstrates that the solution paths can be used to identify “stable” sets 
of best conductance that are likely to yield higher precision labeling results. Consequently, 
these results hint that a smaller, but more precise, set lurks inside of the set of best 
conductance. This information would be valuable when determining additional labels or 
trying to study new data that is not as well characterized as the USPS digits dataset. 


4.6 Discussion 

Overall, these seeded PageRank solution path plots reveal information about the clusters 
and sets near the seeds. Some of the features we’ve seen include nested community structure 
and core-periphery structure. They all provide refined information about the boundary of a 
community containing the seed, and suggest nodes with seemingly anomalous connections 
to the seed. For instance, some nodes enter the diffusion early but have only a slow-growing 
value indicating a weak connection to the seed; other nodes are delayed in entering the 
diffusion but quickly grow in magnitude and end up being significant members of the 
cluster. Each of these features offers refined insights over the standard single-shot diffusion 
computation. 


5 Algorithms 

Here we present two novel algorithms for analyzing a PPR diffusion across a variety 
of accuracy parameter settings by computing the diffusion only a single time. Our first 
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Figure 6. Seeded PageRank solution path plots for diffusions in the USPS digit dataset. 
The seeds are chosen to be images of handwritten digits with the same label. (At left.) 
The solution paths reveal a number of anomalous node trajectories near the set of best 
conductance. Nodes entering the set of best conductance after the black line erratically 
oscillates are most likely to be false positives near the boundary. (At right.) Here, we 
have colored the solution path lines based on the true-class label. Red shows a correct 
label and green shows an incorrect label. 


algorithm (Section 5.2) computes the best-conductance set from the p-approximate 
solution paths described in Section This effectively finds the best-conductance set 


-'inin^ ^maxj 5 


where e- 


min 

- 2 \ 


from PPR diffusions for every accuracy satisfied in an interval [e 
and fTmax are inputs. We prove the total runtime is bounded by — a)~^{1 — p)~'^)^ 

though we believe improvements can be made to this bound. In addition to identifying 
the best-conductance set taken from the different approximations, the algorithm enables 
us to study the solution paths of PageRank, i.e. how the PPR diffusion scores change as 
the diffusion’s accuracy varies. Hence, we call this method ppr-path. 

We describe a second algorithm optimized for speed (Section 5.3) in finding sets 
of low conductance, as the exhaustive nature of our first method generates too much 
intermediate data for stricter values of e. Instead of computing the full solution paths, the 
second method searches for good-conductance sets over an approximate solution for each 
accuracy parameter taken from a grid of parameter values. The spacing of the accuracy 
parameters values on the grid is an additional input parameter. For this reason, we call 
the algorithm ppr-grid. For a log-spaced grid of values £o > ei > • • • > Eat, we locate the 
best-conductance set taken from a sweep over each £fc-approximation. The work required 
to compute the diffusions is bounded by 0(e(^^(l — we show this yields a constant 

factor speedup over the practice of computing each diffusion separately. However, our 
method requires the same amount of work for performing the sweeps over each different 
diffusion. 

We begin by describing a modification to the PageRank linear system that will simplify 
our notation and the exposition of our algorithm. 
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5.1 A modified PageRank linear system for the push procedure 


Recall that the goal is to solve the PageRank linear system (2.1) to the accuracy condi¬ 


tion (2.2) and then sort by the elements yLjjdj. If we multiply Equation (2.1) by D 


then after some manipulation we obtain 

(I - aP^)D"^x = (1 - a)D" V. 

Note this transformation relies on A being symmetric so that P^ = (AD^^)^ = D^^A = 
D^^PD. To avoid writing repeatedly, we make the change of variables y = (1/(1 — 
a))D^^x and b = D~^v. The modified system is then 

(I-aP^)y = b (5.1) 

and we set x^^^ = (1 — a)Dy(^^. 

Next we use this connection between x and y enables us to establish a convergence 
criterion for our algorithms that will guarantee we obtain an approximation with the kind 


of accuracy typically desired for methods related to the push operation, e.g. (2.2). More 
concretely, to guarantee ||D“^(x — x)||oo < it suffices to guarantee ||y — ; 


< e, 


so it suffices for our purposes to bound the error of the system (5.1). 

The accuracy requirement has two components: nonnegativity, and error. We relate the 
solution to its residual as the first step toward proving both of these. Left-multiplying 
the residual vector for (5.1) by (I — aP^)~^ and substituting y = (I — aP’^)“^b, we get 







where the right-hand side replaces (I — q;P^)“^ with its Neumann series. Note here 
that, if the right-hand side consists of all nonnegative entries, then it is guaranteed that 
y ~ y*'^^ > 0 holds. Recall from Section |3T] that the residual update involved in the push 
procedure consists of adding nonnegative components to the residual, and so the residual 
must be nonnegative. Then, since (1 — Q;)y(^) = this implies x > proving 


one component of the accuracy criteria ( |2.2[ ) is satisfied. 

Next we bound the error in y in terms of its residual, and then control the residual’s 
norm. Using the triangle inequality and sub-multiplicativity of the infinity norm allows 
us to bound 


ly-y^^'^lioo, 


which implies (2.2), with the following 


OO 

U“’"||(p^) 


m—0 






\m—0 




Finally, since P is column stochastic, P^ is row-stochastic, and so ||P^||oo = 1- Substi¬ 
tuting this and noting that J2m=o ~ 1/(1 ~ o:) allows us to bound 


1 

l-a 




y - y^") 


<T^ 


Jk) 


So to guarantee x satisfies the desired accuracy, it is enough to guarantee that 


.(fc) 


< e 


(5.2) 


holds, where = b — (I — aP^)y^^^ and x^^^ = (1 ~ a)Dy(^^. Thus, for our algorithms 
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to converge to the desired accuracy, it suffices to iterate until the residual norm satisfies 
the bound (5.2). With this terminating condition established, we can now describe our 
algorithm for computing the solution paths of Xg as e varies. 


5.2 PageRank solution paths 


Recall that our goal is computing the solution paths of seeded PageRank with respect 
to the parameter e. That is, we want an approximation Xg of PageRank for all e values 
inside some region. Let P be a stochastic matrix, choose a satisfying 0 < a < 1, let v be a 
stochastic vector, and set b = D^^v. Fix input parameters £niin and £max- Then for each 
value £cur € [eminj £max] (^cur denotes “the value of e currently being considered”), we 
want an approximation y of the solution to (I — aP^)y = b that satisfies ||y —y||oo < 

(Or rather, we want a computable approximation to this information.) As discussed in 


Section 3.2 we also use the approximation parameter p G [0,1) in the push step. 

Given initial solution y(°^ = 0 and residual = b, proceed as follows. Maintain a 
priority queue, Q(r), of all entries of the residual that do not satisfy the convergence 
criterion < Smin- We store the entries of Q{r) using a max-heap so that we can quickly 
determine ||r||oo at every step. 

Each time the value ||r|joo reaches a new minimum, we consider the resulting solution 
vector to satisfy a new “current” accuracy, which we denote £cur- For each such £cur 
achieved, we want to perform a sweep over the solution vector. Because the sweep operation 
requires a sorted solution vector, we keep y in a sorted array, L{y). By re-sorting the 
solution vector each time a single entry y^ is updated, we avoid having to do a full sweep 
for each “new” £cur-approximation. The local sorting operation is a bubblesort on a single 
entry; the local sweep update we describe below. 

With the residual and solution vector organized in this way, we can quickly perform 
each step of the above iterative update. Then, iterating until ||r||oo < emin guarantees 
convergence to the desired accuracy. Next we present the iteration in full detail. 


PPR path algorithm 

The ppr-path algorithm performs the following iteration until the maximum entry in 
Q{r) is below the smallest parameter desired, £min- 

1. Pop the max of Q(r), say entry j with value r, then set Vj = pScur and reheap Q{r). 

2 . Add r - pe^nr to y^ . 

3. Bubblesort entry y^ in T(y). 

4. If T(y) changes, perform a local sweep update. 


5. Add (r — pe^ 


r)aP^ej 


to r. 


6. For each entry i of r that was updated, if it does not satisfy < Eniin, then insert 
(or update) that entry in Q{r) and re-heap. 

7. If ||r||oo < Ecur, record the sweep information, then set £cur = ||r||oo- 

When the max-heap Q(r) is empty, this signals that all entries of r satisfy the conver¬ 
gence criterion Vj < £min) and so our diffusion score approximations satisfy the accuracy 
requirement (2.2). 
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Sweep update 

The standard sweep operation over a solution vector involves sorting the entire solution 
vector and iteratively computing the conductance of each consecutive sweep set. Here, we 
re-sort the solution vector after each update by making only the local changes necessary 
to move entry y^- to the correct ranking in L{y). This is accomplished by bubblesorting 
the updated entry y^- up the rankings in L{y). Note that if y*-*^ has Tk nonzero entries, 
then this step can take at most Tk operations. We believe this loose upperbound can be 
improved. We could determine the new rank of node y^- in work logT^ via a binary insert. 
However, since we must update the rank and sweep information of each node that node 
yj surpasses, the asymptotic complexity would not change. 

Once the node ranks have been corrected, the conductance score update proceeds as 
follows. Denote by (to) the set of nodes that have rankings 1 , 2 , • • • , to during step 

k — 1. Assuming we have the cut-set (cut and volume) information for each of these sets, 
then we can update that information for the sets (to) as follows. 

Suppose the node that changed rankings was promoted from rank j to rank j — A^. 
Observe that the sets (to) and their cut-set information remain the same for any set 
S'(*^(to) lying inside the rankings [1, • • • , j — — 1 ], because the change in rankings 

happened entirely in the interval [j — A^, • • • ,j]. This occurs for m < j — Aj,. Similarly, 
any set (to) with m > j would already contain all of the nodes whose rank changed - 
altering the ordering within the set does not alter the conductance of that set, and so 
this cut-set information also need not be changed. Hence, we need to update the cut-set 
information for only the intermediate sets. 

Now we update the cut-set information for those intermediate sets. We refer to the 
node that changed rank as node L{j). Its old rank was j, and its new rank is j — A^. Note 
that the cut-set information for the set — t) (for t = 0 , • • • , A^) is the exact same 

as that of set — t — 1) U {L{j)}. In words, we introduce the node L{j) to the set 

— t — 1 ) from the previous iteration, and then compute the cut-set information 
for the new iteration’s set, — t), by looking at just the neighborhood of node L{j) a 

single time. This provides a great savings over simply reperforming the sweep procedure 
over the entire solution vector up to the index where the rankings changed. 

If the node being operated on, L{j), has degree d, then this process requires work 
0{d+Ak). As discussed above, we can upperbound A^ with the total number of iterations 
the algorithm performs Tk- 


Theorem 5.1 Given a random walk transition matrix P = AD~^, stochastic vector v, 
and input parameters a G (0,1), p G [0,1), and Emax > £min > 0, our ppr-path algorithm 
outputs the best-conductance set found from sweeps over ecur-accurate degree-normalized, 
p-approximate solution veetors x to {I—aP)x = (\ — a)v, for all values Scur G [Sminj^max]- 
The total work reguired is bounded by — (i_a) 2 (i_p) 2 ) ■ 


Proof We carry out the proof in two stages. First, we show that the basic iterative 
update converges in work 0(e“(jj(l — q;)“^(1 — p)“^). Then, we show that the additional 
work of sorting the solution vector and sweeping is bounded by — Q;)“^(l — p)~^)- 
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Push work. We count the work on just the residual and solution vector The 
work required to maintain the heap Q and sorted array L is accounted for below. 

Each step, the push operation acts on a single entry in the residual that satisfies 

> £min- The step consists of a constant number of operations to update the residual 
and solution vectors (namely, updating a single entry in each). The actual amount that is 
removed from the residual node is {rj — pStnin)] then we add {rj — pSmin) to the appropriate 
entry of the solution, and (rj — pei-ain)o'/dj to for each neighbor i of node j. Since j 
has dj such neighbors, the total work in one step is bounded by O (dj). If T steps of the 
push operation are performed, then the amount of work required to obtain an accuracy 
of emin is bounded by where j = j(t) is the index of the residual operated on in 

step t, 

Next we bound this expression for the work done in these “push” steps. Since all 
entries of the solution and residual vectors are nonnegative at all times, the sum of the 
values (r* — yOEmin) pushed at each step exactly equals the sum of the values i.e. 

~ P^min) = Since y^^^ = (1/(1 — q;))D~^x(^\ we then have that the sum 

of entries in ( 1/(1 — equals the sum of values pushed from the residual scaled by 

degree and (1 — a), i.e. = (1 — a) ~ P^min) • where j(t) is the node 

pushed in step t. We claim that the sum e^x^^^ < 1. Assuming this for the moment, we 
get from the previous equation that (1 — a) ~ P^min) • < 1. Since 

each step of ppr-path operates on a residual value satisfying rt > emin, we know that 

[vt P^^min) ^ ^min(l p); and SO 

T T 

(1 - a) ^ ^min(l - p) • dj(t) < (1 - a) X! ^ 

i—0 t—0 

Dividing by eniin(l —a)(l —p) completes the proof that the expression for work, 
is bounded by O (e/iin(l ~ 0!)“^(1 — p)“^). 

Lastly, we justify the claim e^x^^^ < 1. Left-multiplying the equations in 
(De)^ and using stochasticity of v gives 

e'^(I - aP)Dy(*^) = e'^Db - 
(1 - = e'^v - 

= l-e^Dr('=). (5.3) 

As noted above, all entries of the residual and iterative solution vector are nonnegative 
at all times. The sum cannot exceed 1, then, because that would imply that the 

residual summed to a negative number, contradicting nonnegativity of the residual vector. 
Hence, e^x^^^ < 1. 

Sorting and sweeping work. Here we account for the work performed each step in 
maintaining the residual heap Q(r), re-sorting the solution vector L(y), and updating the 
sweep information for L(y). To ease the process, we first fix some notation: denote the 
number of entries in the residual heap Q(r) by \Q\, and the number of non-zero entries 
in the sorted solution vector L(y) by \L\. We will bound both of these quantities later 
on. We continue to use Aj to denote the number of rank positions changed in L(y) in 


(5.1) by 
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step t. Finally, recall that T denotes the number of iterations of the algorithm required 
to terminate. 

The work bounds we will prove, listed in the order in which the ppr-path algorithm 
performs them, are as follows: 


Operation 

actual work 

upperbound 

Find max(r) 

1 

1 

Delete max(r) 

log(IQI) 

^°S(£min(l-a)(l-p)) 

Bubblesort T(y^) 

At 

T 

Re-sweep L{y) 

dj + At 

dj -I- T 

Update r -I- raP^e^ 

dj 

dj 

Re-heap Q(r) 

djlog(IQI) 

^3 ^Og(£„i„(l-a)(l-p)) 


The residual heap operations for deleting max Q{r) and re-heaping the updated entries 
each require 0(log(|Q|)) work, where |(5| is the size of the heap, i.e. the number of nonzero 
entries in the residual. We can upperbound this number using the total number of pushes 
performed (since a nonzero in the residual can exist only via a push operation placing it 
there). We bound |(5| by then. We remark that this is quite 

loose, as values of p near 1 actually force the solution and residual to be sparser, so the 
heap size should still be bounded by though we do not yet have a proof 

of this. 

Re-sorting the solution vector via a bubblesort can involve no more operations than 
the length of the solution vector. Since a nonzero in entry y^- can exist only if a step of 
the algorithm operates on an entry Vj, the number of nonzeros in y is bounded by the 
number of steps of the algorithm, i.e. \L\ < T. We believe this bound to be loose, but 
cannot currently tighten it. Note that the work required in updating sweep information 
also requires A* work, which we again upperbound by T. The dj term in updating sweep 
information is from accessing the neighbors of the entry y^ , the node changing its rank. 

The dominant terms in the above expression for work are the re-heap updates and the 
bubblesort and re-sweep operations, which require a total of 0{dj log(|(5|) -I- \L\) work each 
step. Summing this over all T steps of the algorithm, we can majorize work by 0(log(|(5|) • 
ELo c^j) + 0(ELo '^hich is upperbounded by O log(l<5l) + T ■ |T|) . 

Finally, substituting in our loose upperbounds for T, |(5|, and \L\ mentioned above 
completes the proof: 


O ( - ... ^ w. -Y log( 


£min(l ^)(1 P) ^ 




)<o( 


eL„(l-a)^(l-p)^ 


□ 


5.3 Fast multi-parameter PPR 

Here we present a fast framework for computing £-approximations of a push-based PPR 
diffusion without computing a new diffusion for each e. This enables us to identify the 
optimal output that would result from multiple diffusion computations for different 
£ values, but without having to do the work of computing a new diffusion for each 
different e. This algorithmic framework does not admit the parameter p as easily, because 
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of implementation details surrounding the data structures used to handle sorting and 
updating the residual. 

The framework is compatible with every set of parameter choices for e that allows for 
constant-time bin look-ups. More precisely, the set of parameters £o, ei, ..., Eat must 
have an efficient method for determining the index k such that, given a value r, we have 
£fc_i > r > Ek- We focus on a set of £ values that are taken from a log-spaced grid: that is, 
the parameters are of the form Sk = for constants 0 < £o,0 < 1. Because we assume 
our £ parameters are taken from such a grid, we call our method ppr-grid. Another 
possibly useful case is choosing values taken from a grid formed from Chebyshev-like 
nodes, allowing for constant-time shelf-placement via cos“^ evaluations. 

We emphasize that the underlying algorithm we use to compute the PageRank diffusion 
is closely related to the push method discussed in Section as implemented by [^ ; in the 
case that only a single accuracy parameter is used, the algorithms are identical. When 
more than one accuracy setting is used, we employ a special data structure, which we call 
a shelf. 


The shelf structure 

The main difference between our algorithm ppr-grid and previous implementations of 
the push method lies in our data structure replacing the priority queue, Q, discussed 
in ppr-path. Instead of inserting residual entries in a heap as in ppr-path, we organize 
them in a system of arrays. Each array holds entries between consecutive values of £fc, so 
that each array holds entries larger than the shelf below it. For this reason, we call this 
system of arrays a “max-shelf’, H, and refer to each individual array as a “shelf’, Hk- 
The process is effectively a bucket sort: each shelf (or bucket) of H holds entries of the 
residual lying between consecutive values of £k in the parameter grid. For parameters 
£o,£i, ..., £n, shelf Hk holds residual values r satisfying Sk-i > r>Sk- Residual entries 
smaller than £ 7 v are omitted from H (since convergence does not require operating on 
them). Residual entries with values greater than £o are simply placed in shelf Hq. 


PPR on a grid of e parameters 

During the iterative step of ppr-grid, then, rather than place a residual entry at the 
back of Q, we instead place the entry at the back of the appropriate shelf, Hk- Once all 
shelves i/m(r) are cleared for m < k, then the residual has no entries larger than Ek, and 
so we have arrived at an approximation vector satisfying convergence criterion (2.21 with 
accuracy Ek- At this point, we perform a sweep procedure using the £fc-solution. We then 
repeat the process until the next shelf is cleared, and a new £fc+i-solution is produced. 
PPR grid algorithm. The iterative step is as follows: 

1. Determine the top-most non-empty shelf, Hk- 

2. While H contains an entry in shelf k or above, do the following: 

3. Pop an entry on or above shelf Hk, say value r in entry Vj, and set r, = 0. 

4. Add r to x,-. 


5. 

6 . 


For each entry of r that was updated, move that node to the correct shelf, H„ 
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where £m-i > r > Sm- If an entry is placed on a shelf higher than k, record the new 
top-shelf. 

7. Shelves 0 through k are cleared, so the efc-solution is done; perform a sweep. 

Once all shelves are empty, the approximation with strictest accuracy, Sn, has been 
attained, and a final sweep procedure is performed. 

Shelf computation. In each iteration of ppr-grid we must place multiple entries 
into their respective “shelves”. Here we show that computing the correct shelf where a 
value r will be placed can be accomplished in constant time. 

Let £k = £o0^ for a fixed value of 0 € (0,1). We want a value r satisfying £k-i > r > Sk 
to be placed on shelf fc. If r > eo, then we place r into shelf 0. Otherwise, making the 
substitution = SqO'^ and performing some algebra yields 


fc — I < 


log(r/go) 

log(0) 


< k, 


so k can be computed by taking the ceiling of log(r/go)/log(6*), which is a constant time 
operation. Note that this process requires that 0 < < 1 holds for all k, that 9 G (0,1), 

and that r > 0. 

Top shelf. Each step of ppr-grid also requires determining the top non-empty shelf. 
This can be done in constant time by tracking what the top shelf is during each residual 
update. If k is the top shelf immediately prior to step (2.4), then k will still be the top 
shelf after the residual update is complete, unless one of the updates in step (6.) moves 
an entry to a shelf I < k. By checking for this event during the update of each individual 
residual entry in step (6.), we will have knowledge of the top non-empty shelf at the 
beginning of each step, with only constant work per step. 

Once the current working shelf is emptied, then it is possible that the next non-empty 
shelf is many shelves down, i.e. shelves H/- and higher are emptied and the next non-empty 
shelf is Hk+c for some large number c. Then determining k + c takes 0{c) operations. 
However, this operation is performed every time the algorithm switches from one value of 
gfc to the next. If there are N values of gfe, then the total work in all calls of this top-shelf 
computation is bounded by 0{N). 


Theorem 5.2 Given a random walk transition matrix P = AD~^, stochastic vector 
V, and input parameters a,9 G (0,1) and £k = £[)d^, our ppr-grid algorithm outputs 
the best-conductance set found from sweeps over Sk-accurate degree-normalized solution 
vectors x to (J— aP)x = (I — a)v, for all values Sk for k = 0 through N. The work 
in computing the diffusions is bounded by ). This improves on the method of 

computing the N diffusions separately, which is bounded by (1 — 9^^^)). 

The two methods perform the same amount of sweep-cut work. 


Proof. Note that the amount of push-work required to produce a diffusion with smallest 
accuracy Ea? is exactly the same as the push-work performed in computing an solution 
via ppr-path; The only difference is in how we organize the residual and solution vectors. 
Hence, the push-work for ppr-grid is bounded by 0(g)^^(I — Updating the shelf 

structure for ppr-grid requires only a constant number of operations in each iteration. 
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Table 1. Datasets 


Graph 

|F| 

\E\ 

r^ave 

itdk0304 

190,914 

607,610 

6.37 

dblp 

226,413 

716,460 

6.33 

youtube 

1,134,890 

2,987,624 

5.27 

fb-one 

1,138,557 

4,404,989 

3.9 

fbA 

3,097,165 

23,667,394 

15.3 

1journal 

5,363,260 

49,514,271 

18.5 

hollywood 

1,139,905 

56,375,711 

98.9 

twitter 

41,652,230 

2,041,892,992 

98 

friendster 

65,608,366 

1,806,067,135 

55.1 


and so the dominating operation in one step of ppr-grid is the residual push work. Thus, 
the push-work bound for ppr-grid is 0(£^^(1 — 

Push-work for N separate diffusions. As noted above, computing a diffusion 
with parameters Sk and a requires push-work Summing this over all 

values of Sk gives = (1 “ ■ Substituting in 

place of £k, we see this sum is simply a scaled partial geometric series, ~ 

£q — 0^+^)/(l — 9). Simplifying gives 

N 

X! ek{l-a) = Ejv(l-a)(l-e) , 

fe =0 

proving the bonnd on the push-work. For our choices £o = 10“^, £jv = 10“®/3, and 
9 = 0.66 (which correponds to using N = 32 diffusions), this quantity is roughly 2.9 times 
greater than computing only one diffusion, as our method does. 

Sweep work. The number of operations required in computing the diffusion is bounded 
by 0(£)^^(1 — a)“^), but this does not include the work done in sweeping over the various 
£fc-approximation vectors. The sweep operation requires sorting the solution vector. As 
noted in the proof of work for ppr-path, the number of nonzeros in the solution vector is 
bounded by 0(£)^^(1 —a)“^), and so the sorting work is 0{£jj^ (1-a) ilog(£^^(l-a) 1)). 
This implies that sorting is the dominant subroutine of the algorithm. In practice the 
bound on the number of nonzeros in the solution is loose, and the push operations 
comprise most of the labor. 


6 Experimental Results on Finding Small Conductance Sets 


We have presented two frameworks for computing a single personalized PageRank diffusion 
across multiple parameter settings. Here we analyze their performance on a set of real- 
world social and information networks with varying sizes and edge-densities with the goal 
of identifying sets of small conductance. All datasets were altered to be symmetric and 
have Os on their diagonals; this is done by deleting any self-edges and making all directed 
edges undirected. In addition to versions of the Facebook dataset analyzed in Section 
we test our algorithms on graphs including twitter-2010 from [T^, friendster and youtube 
from [M 31 , dblp-2010 and hollywood-2009 in IslE], idk0304 from 27 , and ljournal-2008 


[^. See Table for a summary of their properties. 
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6.1 The effect of p on conductance 

Our first experimental study regards the selection of the parameter p for finding sets of 
small conductance. We already established that p = 0.9 yielded qualitatively accurate 
solution path plots. However, for the specific problem of identifying small conductance 
sets, we find a curious behavior and get the best results with small values of p. We’ll 
explain why this is shortly, but consider the results in FigureIn the left subplot, we see 
the maximum difference between the minimum conductance found for any value of p over 
a series of trials, ft can be large, for instance, 0.7 for one trial on the LiveJournal graph, 
where large p shows worse results. In that same figure, we show the runtime scaling, ft 
seems to scale with 1/(1 — p), which is slightly better than expected from the bound in 
Theorem 15.11 


0.7 


0.6 


q.0.5 
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8 0.4 
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Figure 7. Here we display the behavior of the solution paths as p scales from 0 to 1. 
At left, we display the gap between (/(p), the best conductance found at that value of 
p, and (jjmin, the minimum conductance found over all values of p. The lines depict the 
maximum difference over 100 trials of the quantity (/(p) — </>min- This plot shows that the 
best conductance found becomes worse as p approaches 1. At right, the runtime appears 
to scale with 1/(1 — p), which is better than the 1/(1 — p)^ predicted by our theory. 


The greatest difference between the best conductance found for any value of p and 
the worst conductance found for any p occurs in the livejournal graph, with a gap of 
nearly 0.7. We discovered that the cause for this disparity is that large values of p delay 
the propagation of the diffusion, and so the p = 0.9 paths at e = 10“^ did not spread 
far enough to find a set of conductance near 0.07. In contrast, all paths with p < 0.5 
did diffuse deep enough into the graph to identify this good conductance set. Thus, it is 
possible that many of the differences in conductance performance between paths with 
different values of p might in fact be caused by the size of the region to which the diffusion 
spreads for a given value of e. Figure [^illustrates this finding. 

Our conclusion from these experiments is that, for the goal of finding sets of small 
conductance, we should use small values of p near zero. While it sometimes happens that 
p > 0 slightly improves conductance, this is not a reliable observation, and so for the 
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Figure 8. At left, the p = 0 paths identify mostly poor conductance sets (p « 0.8, and 
locate a set of low conductance, </> = 0.0788, only toward the end of the diffusion. At right 
we see that the p = 0.9 paths cannot find this set with e = 10“®. With a slightly smaller 
accuracy (e = 5 • 10“® instead of e = 10“®), the diffusion is able to spread far enough to 
locate the good conductance set. 


remaining experiments on conductance, we set p = 0. (This has the helpful side effect of 
making it easier to compare with our ppr-grid.) 


6.2 Runtime and conductance: ppr-path 

Our first method, ppr-path, is aimed at studying how PPR diffusions vary with the pa¬ 
rameter e. Toward this. Table emphasizes the shear volume of distinct e-approximations 
that ppr-path explores. We also want to highlight both the efficiency of our method over 
the naive approach for computing the solution paths, and the additional information that 
the solution paths provide compared to a single diffusion. 

With this in mind, our experiment proceeds as follows. On each data set, we selected 
100 distinct nodes uniformly at random, and ran three personalized PageRank algorithms 
from that node, with the settings a = 0.99 and e = 10“®. Table displays results for 
our solution paths algorithm (“path” in the table) compared with two other algorithms 
chosen to emphasize the runtime and the performance of ppr-path. 

To show how ppr-path scales compared to the runtime of a single diffusion, and to 
emphasize that the solution paths can locate better conductance sets in some cases, we 
compare our solution paths method with a standard implementation for computing a 
single PPR diffusion (“single” in Table [^. Column 3 in the table gives the median runtime, 
taken over 100 trials, of the single diffusion. To compare, column 4 gives the median ratio 
of “path” time to “single” time. Although ppr-path is slower on the small graphs, on 
the larger graphs we see the runtime is nearly the same as for a single PPR diffusion. At 
the same time, column 2 shows that “path” computes the results from hundreds or even 
thousands of diffusions, a signihcant gain in information over the single PPR diffusion. 
Finally, column 7 gives the best ratio of conductance found by “path” compared to that 
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Data 

num e 

Single diff. time (sec.) 

ppr-path time (sec.) 

multi difF. time (sec.) 

0-ratio 

25 

50 

75 

25 

50 

75 

25 

50 

75 


itdk0304 

5292 

0.02 

0.02 

0.03 

0.28 

0.41 

0.69 

70.8 

94.2 

123.2 

1.77 

dblp 

8138 

0.02 

0.02 

0.02 

0.40 

0.51 

0.65 

87.3 

97.9 

111.5 

1.12 

youtube 

2844 

0.01 

0.01 

0.01 

0.05 

0.10 

0.15 

28.6 

38.7 

49.2 

1.47 

fb-one 

3464 

0.01 

0.01 

0.01 

0.03 

0.05 

0.07 

28.1 

34.6 

40.5 

1.09 

fbA 

862 

< 0.01 

< 0.01 

0.01 

0.01 

0.01 

0.01 

14.0 

16.5 

19.5 

1.16 

Ijournal 

2799 

0.01 

0.01 

0.01 

0.01 

0.02 

0.05 

24.5 

30.9 

43.6 

2.09 

hollywood 423 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

0.01 

14.0 

17.2 

22.4 

1.19 

twitter 

172 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

0.01 

6.5 

10.3 

18.1 

1.05 

friendster 

402 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

< 0.01 

0.01 

11.1 

13.6 

16.6 

1.09 


Table 2. Runtime and conductance comparison of the solution paths (all accuracies from 


10“^ to 10“®) with (1) a single PPR diffusion with accuracy 10“® (labelled “single”) and 
(2) 10,000 PPR diffusions, accuracies k~^ for fc = 1 to 10,000 (labelled “mult”). On each 
dataset we selected 100 distinct nodes uniformly at random and ran the algorithms with 
the settings a = 0.99 and e = 10“® and p — Q. Column “num e” displays the median 
number of distinct accuracy parameters e explored by our algorithm ppr-path. Columns 
under “Time” report 25th, 50th, and 75th percentile of runtimes over these 100 trials. 
The column “0-ratio” lists the largest (best) ratio of conductance achieved by a single 
diffusion with conductance achieved by our ppr-path, showing our method can improve 
on the conductance found by a single diffusion by as much as a factor of 2.09. 


found by “single”. This shows that the solution paths can improve conductance by 10% 
to even 50% compared to a single diffusion. 

To display the efficiency of our algorithm in computing these many diffusion settings, 
we again use the standard PPR implementation, but this time set to compute the diffusion 
for every accuracy setting k~^ for fc = 1 to 10,000. This algorithm is “mult” in Table 
and is essentially a naive method for approximating the solution paths. Column 5 gives the 
ratio of “mult” time to “single” time, and shows that this naive approach to computing 
diffusions with multiple accuracies is prohibitively slow - it is thousands of times slower 
than our “path” method. 

Lastly, we acknowledge here that both variations on the PPR diffusion are naive 
approaches to the problem at hand. However, currently there is no other algorithm for 
computing the PPR solution paths which we can use as a more competitive baseline. 


6.3 Runtime and Conductance: ppr-grid 

We compare our second method ppr-grid with a method called ppr-grow, which uses the 
push framework described in Section]^ Each of these algorithms uses a variety of accuracy 
settings, and returns the set of best conductance found from performing a sweep-cut over 
the diffusion vector resulting from each accuracy setting. The algorithm ppr-grow has 
32 pre-set accuracy parameters e^. In contrast with ppr-grid, which takes its accuracy 
parameters from a log-spaced grid ^ the parameters for ppr-grow are chosen 

as the inverses of values from the grid 10^ • [ 2 3 4 5 10 15 ] for j = 0,1, • • • , 4, 

along with two additional parameters, 10“®/2 and 10“®/3. 

In addition to a, our method ppr-grid has the parameters Sq and e^v, the laxest and 
strictest accuracies (respectively), and 9, which determines the fineness of the grid of 
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accuracy parameters. We use the values Eq = 10 ^ and En = 10 ®/3, and use values of 9 
corresponding to = 32, 64, and 1256 different accuracy parameters. 

We emphasize that this comparison with the ppr-grow method is not as naive as it 
might seem: out of the 32 calls that it makes, in practice the very last call (with the 
strictest value of e) constitutes near 37% of the total runtime. This means that making 
only a single call would save little work, and would sacrifice the information from the 
other 31 (smaller) approximations. Furthermore, the primary optimizations that would 
be made to the ppr-grow framework to improve on this are exactly the optimizations 
that we make with our ppr-grid algorithm, namely avoiding re-doing push work between 
diffusion computations for different values of s. 


Because the two algorithms compute the same PageRank diffusion, comparing their 
runtimes here allows us to study what proportion of the total work is made up of redundant 
push operations, and what proportion is comprised of the sweep cut procedures, which 
both algorithms perform anew for each diffusion. To study this, we highlight the results 
in Table which displays the runtimes for ppr-grow and the ratios of the runtimes of 
ppr-grid with ppr-grow for computing the best-conductance set from the same number 
of different diffusions, N = 32. We also display ppr-grid results for the cases iV = 64 
and 1256 to show how the algorithm scales with the fineness of the grid. 


To compare runtimes, we perform the following for each different dataset. For 100 
distinct nodes selected uniformly at random, we ran both algorithms with the setting 
a = 0.99. We display the best (25%) and worst (75%) quartile of performance of each 
algorithm and parameter setting. On almost all datasets, we see that ppr-grid with 
iV = 32 has a speedup of a factor 2 to 3. This is consistent with our theoretical comparison 
of the two runtimes in Theorem 5.2 which predicts a factor of 2.9 difference in the push- 
work that the two algorithms perform. Then, columns 6 through 9 of Table display how 
quickly ppr-grid can compute even more diffusions: whereas ppr-grow takes around 1 
second to compute and analyze N = 32 diffusions, ppr-grid takes little more than half 
that time to compute on = 64 diffusions (columns 6 and 7). Columns 8 and 9 show 
that ppr-grid can compute and analyze N = 1256 diffusions, nearly 40 times as many 
as ppr-grow, in an amount of time only 1.10 to 6.59 times greater than the time required 
by ppr-grow. 


The conductances displayed in Table are taken from the same trials as the runtime 
information in Table As with the table of runtimes, for each dataset the table gives 
the 25% (best) and 75% (worst) percentiles of conductance scores produced by each 
algorithm on the 100 trials. We see nearly identical conductance scores for ppr-grow and 
ppr-grid with N = 32, which we expect because the two perform nearly identical work. 
It is interesting to note, however, that increasing the number of diffusions can result in 
significantly improved conductance scores in some cases, as with N = 1256 on the “fb-one” 
and “hollywood” datasets. This demonstrates concretely the potential effect of using a 
broad swath of parameter settings for e to study the meso-scale structure. Moreover, it 
demonstrates that even a finely spaced mesh of e values, as with ppr-grow and ppr-grid 
with N = 64, can miss informative diffusions. 



26 


D. F. Gleich and K. Kloster 


Data 

time (sec.) 
ppr-grow 

time ratio 
ppr-grid N = 32 

time ratio 
ppr-grid TV = 64 

time ratio 
ppr-grid TV = 1256 

25 

75 

25 

75 

25 

75 

25 

75 

itdk0304 

6.23 

8.73 

0.56 

0.61 

0.61 

0.66 

1.10 

1.20 

dblp 

4.52 

7.21 

0.56 

0.62 

0.62 

0.67 

1.28 

1.43 

youtube 

1.73 

2.39 

0.39 

0.50 

0.54 

0.65 

3.35 

4.38 

fb-one 

1.25 

1.60 

0.33 

0.39 

0.45 

0.53 

3.72 

4.38 

fbA 

0.49 

0.65 

0.47 

0.55 

0.63 

0.72 

5.99 

6.59 

Ijournal 

0.82 

1.20 

0.44 

0.55 

0.58 

0.74 

4.57 

6.12 

hollywood 

0.28 

0.64 

0.34 

0.49 

0.44 

0.60 

3.47 

5.00 

twitter 

0.13 

0.37 

0.39 

0.44 

0.54 

0.60 

4.61 

5.44 

friendster 

0.34 

0.49 

0.39 

0.44 

0.51 

0.58 

3.90 

4.32 


Table 3. Runtime comparison of our ppr-grid with ppr-grow. For each dataset, we 
selected 100 distinct nodes uniformly at random and ran ppr-grow with 32 and ppr-grid 


with N different accuracy settings £k- Columns 2 and 3 display the 25th and 75th 
percentile runtimes for ppr-grow (in seconds). The other columns display the median over 
the 100 trials of the ratios of the runtimes of ppr-grid (using the indicated parameter 
setting) with the runtime of ppr-grow on the same node. These results demonstrate that 
our algorithm computing over IV = 32 accuracy parameters Sk achieves the factor of 2 to 


3 speed-up predicted by our theory in Section 5.3 


Data 

grow 

II 

. CO 

to 


^ ' 

II ' 


N = 1256 

25 

75 

25 

75 

25 

75 

itdk0304 

0.06 

1.00 

1.00 

1.00 

1.01 

1.00 

1.02 

dblp 

0.07 

1.00 

1.00 

1.00 

1.00 

1.00 

1.01 

youtube 

0.18 

1.01 

1.30 

1.09 

1.50 

1.21 

1.72 

fb-one 

0.37 

1.06 

1.16 

1.10 

1.26 

1.18 

1.37 

fbA 

0.56 

1.00 

1.05 

1.00 

1.06 

1.00 

1.09 

Ijournal 

0.32 

1.00 

1.01 

1.00 

1.01 

1.00 

1.01 

hollywood 

0.29 

1.00 

1.01 

1.00 

1.01 

1.00 

1.02 

twitter 

0.80 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

friendster 

0.85 

1.00 

1.00 

1.00 

1.00 

1.00 

1.01 


Table 4. Conductance comparison of our ppr-grid with ppr-grow. Column 2 displays 
the median of the conductances found by ppr-grow in the same 100 trials presented 
in Table The other columns display the 25% and 75% percentiles of the ratio of the 
conductances achieved by ppr-grow and ppr-grid for the same seed set. For example, 
on the dataset ‘fb-one’, the conductances found by ppr-grow are 18% larger than those 
found by ppr-grid with N = 1256 accuracy settings — and that comparison is on the 
quartile of trials where ppr-grid compares the worst to ppr-grow. We report the ratios 
in this manner (rather than their reciprocals) because in this form the values displayed 
are greater than 1, which distinguishes the values from conductance scores (which are 
between 0 and 1). 


7 Related work 

As we already mentioned, regularization paths are common in statistics [^[^, and they 
help guide model selection questions. In terms of clustering and community detection, 
solution paths are extremely important for a new type of convex clustering objective 
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function 16 22 . Here, the solution path is closely related to the number and size of 


clusters in the model. 

One of the features of the solution path that we utilize to understand the behavior 
of the diffusion is the stability of the set of best conductance over time. In ref. [^, the 
authors use a closely related concept to study the persistence of communities as a different 
type of temporal relaxation parameter varies. Again, they use the stability of communities 
over regions of this parameter space to indicate high-quality clustering solutions. 

In terms of PageRank, there is a variety of work that considers the PageRank vector as 
a function of the teleportation parameter a . Much of this work seeks to understand 

the sensitivity of the problem with respect to a. For instance, we can compute the 
derivative of the PageRank vector with respect to ol. It is also used to extrapolate 
solutions to accelerate PageRank methods [^. More recently, varying a was used to show 


a relationship between personalized-PageRank-like vectors and spectral clustering 23 


Note that PageRank solution paths as a varies would be an equally interesting parameter 
regime to analyze. The parameter a functions akin to e in that large values of a cause 
the diffusion to propagate further in the graph. 


8 Conclusions and discussion 


We proposed two algorithms that utilize the push step in new ways to generate refined 
insights on the behavior of diffusions in networks. The first is a method to rapidly 
estimate the degree-normalized PageRank solution path as a function of the tolerance 
£. This method is slower than estimating the solution of a single diffusion in absolute 
run time, but still fast enough for use on large graphs. We designed that method, and 
the associated degree-normalized PageRank solution path plot, in order to reveal new 
insights about regions at different size-scales in large networks. The second method is a 
fast approximation to the solution path on a grid of logarithmically-spaced e values. It 
uses an interesting application of bucket sort to efficiently manage these diffusions. We 
demonstrate that both of these algorithms are fast and local on large networks. 

The seeded PageRank solution plots, in particular, are effective at identifying a number 
of subtle structures that emerge as a diffusion propagates from a set of seed nodes to the 
remainder of the network. We hope that these become useful tools to diagnose and study 
the properties of large networks. 

As recently established by Ghosh et al. 


10 , there are many related diffusion methods 


that all share Cheeger-like inequalities for specific definitions of conductance. We anticipate 
that our solution path algorithm could apply to any of these diffusions as well. For instance, 
our recent result on estimating the heat kernel diffusion in large graphs is based on the 


push step as well 18 ; we anticipate only mild difficulty in adapting our results to that 
diffusion. 

Fast access to the solution path trajectories provides a number of additional opportuni¬ 
ties that we have not yet explored. We may be able to track multiple clusters directly by 
managing intermediate data. We may be able to find near-optimal conductance sets that 
are larger than those that directly optimize the objective. Also, nodes in an egonet or 
larger set could be further clustered by properties of their solution paths instead of their 
connectivity patterns. 
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